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.->*  A  new  method  for  font  compression  is  introduced  and  compared  to  existing  methods.  A 
very  compact  representation  is  achieved  by  using  a  variant  of  McCreight’s  string  matching 
algorithm  to  compress  the  bounding  contour.  Results  from  an  actual  implementation  are 
given  showing  the  improvement  over  other  methods  and  how  this  varies  with  resolution  and 
character  complexity.  Compression  ratios  of  up  to  150  are  achieved  for  Chinese  characters. 
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0.  Introduction 

With  the  development  of  computer  typesetting  it  has  become  important  to  find  effi¬ 
cient  -ways  to  represent  fonts  in  computers.  We  can  achieve  good  results  at  a  reasonable  cost 
by  adapting  ideas  from  [6]  to  improve  upon  existing  font  compacting  schemes.  We  compare 
an  actual  implementation  of  this  new  scheme  with  an  implementation  of  the  existing  idea 
of  contour  coding.  Data  from  Chinese  characters  has  been  given  special  attention  in  our 
experiments  because  such  characters  are  particularly  challenging;  the  new  scheme  can  of 
course  be  applied  to  Western  alphabets,  where  its  performance  is  even  better. 

Storing  and  manipulating  thousands  of  high  resolution  Chinese  characters  can  be  quite 
expensive.  For  example,  the  S?  >1 ,  a  large  dictionary  of  words  and  phrases,  contains  14,782 
characters.  The  M  ?  A ,  a  dictionary  for  daily  use,  contains  about  11,100  characters 
including  some  modified  and  simplified  characters.  There  are  about  8,000  commonly  used 
characters.  A  sample  of  21,629,372  characters  [1]  from  writings  on  political  theory,  science, 
literature  and  the  arts,  and  in  newspapers  has  shown  that  100  characters  are  used  40%  of 
the  time,  250  characters  are  used  60%  of  the  time,  5 00  characters  80%,  1,000  characters 
90%,  2,000  characters  98%,  and  3,000  characters  more  than  99%.  A  Chinese  computer 
typesetting  system  needs  hundreds  of  thousands  of  characters  including  all  the  different 
sizes  in  four  different  styles. 

In  the  first  section  we  give  a  brief  overview  of  the  various  methods  for  encoding 
fonts.  We  describe  in  detail  a  method  of  contour  coding.  This  is  particularly  important 
because  the  new  string  matched  scheme  is  based  on  it  and  the  contour  coding  method 
is  used  for  comparison.  Finally,  we  explain  the  format  for  string  matched  compression. 
String  matched  compression  depends  on  being  able  to  represent  variable  length  numbers 
efficiently.  The  second  section  describes  a  rather  interesting  approach  to  this  problem. 
Next,  we  present  a  fast  algorithm  based  on  [6]  for  actually  doing  string  matched  font 
compression.  We  then  give  results  including  the  performance  of  the  method  on  a  variety  of 
characters,  its  dependence  on  the  size  and  complexity  of  the  characters,  and  a  comparison 
with  other  methods.  These  results  look  promising  but  there  is  room  for  improvement.  In 
the  final  section  we  deal  with  these  and  present  empirical  results. 

1.  Methods  for  Character  Encoding 

Bitmap  encoding 

The  bitmap  is  a  simple  method  for  encoding  characters  where  one  bit  is  used  to 
specify  the  contents  of  each  pixel.  The  advantages  of  this  method  are  that  decoding  speed 
is  maximal  and  only  a  small  buffer  is  needed.  Since  the  size  of  the  code  grows  as  the 
square  of  the  linear  resolution,  bitmap  encoding  is  best  suited  to  low  resolution.  Using  this 
method,  experimental  Chinese  computer  systems  have  been  set  up  that  print  characters  of 
16  X  16  and  24  X  24  pixels. 

Run  length  encoding 

Run  length  encoding  uses  rectangular  areas  one  bit  thick  called  runs.  It  is  shown 
in  [2]  that  if  n  is  the  size  of  the  matrix,  run  length  encoding  produces  output  of  length 
0(fcn  logn),  where  k  is  the  number  of  runs  per  a  scan  line.  The  factor  k  is  really  a  measure 
of  the  complexity  of  the  character  pattern.  For  Roman  fonts,  k  is  approximately  4,  while 
it  is  more  than  10  for  most  Chinese  characters.  For  10.5  point  Chinese  characters  of  sise 
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n  =  128,  the  compression  ratio  over  bitmap  encoding  is  about  n  =  -  « s  1.8.  We 

can  see  that  run  length  encoding  is  not  particularly  effective  for  typical  Chinese  character 
typesetting.  However,  run  length  encoding  has  the  advantage  that  it  is  almost  as  fast  as 
bitmap  encoding  and  it  also  requires  very  little  buffer  space. 

Differential  run  length  encoding 

This  method  differs  from  run  length  encoding  in  that  we  represent  the  positions  of 
a  stroke  boundary  on  a  scan  line  by  the  displacement  from  the  boundary  points  on  the 
adjacent  scan  line.  This  method  uses  only  O(kn)  space  for  n  X  n  bitmaps  since  the 
displacement  from  adjacent  scan  lines  is  usually  small. 

Contour  coding 

The  basic  idea  behind  the  type  of  contour  coding  we  use  here  is  that  the  black  pixels 
are  surrounded  by  a  bounding  contour.  The  contour  can  be  restricted  to  an  orthogonal 
grid  as  illustrated  below. 


In  this  example,  we  give  the  coordinates  of  all  the  black  pixels  and  also  of  certain  points 
on  the  bounding  contour.  If  a  point  on  the  boundary  has  coordinates  (t,  j),  then  the  pixel 
at  its  upper  right  also  has  coordinates  (t,  j)  but  the  pixel  at  its  lower  left  has  coordinates 
(*— 1,  j— 1).  This  means  that  the  maximum  possible  coordinates  for  points  on  the  bounding 
contour  are  always  one  more  than  the  coordinates  of  the  upper  right  hand  pixel.  Contour 
coded  files  begin  with  a  header  giving  the  number  of  contours,  N,  the  size  of  the  character 
box,  and  the  position  of  the  contours  within  the  character  box.  The  specification  of  the 
actual  contours  follows.  The  file  is  thought  of  as  a  bit  string  with  word  boundaries  ignored, 
and  it  can  be  diagrammed  as  follows. 

m  S' max  ){y  max)zi!/i  dy ...  Xf^yffdffOQO'i . . .  <rn— i  (1) 

The  character  box  contains  pixels  with  *  coordinates  ranging  from  0  to  xmax  and  V 
coordinates  ranging  from  0  to  ymax.  The  bounding  contour  for  the  »th  boundary  starts 
at  (i,-,  j/,)  in  direction  d<.  Quantities  in  angle  brackets  represent  variable  length  numbers. 
(See  below.)  The  variables  x<,  yi,  and  d*  are  fixed  length  numbers  and  each  <r<  represents 
either  “left” ,  “right” ,  or  “straight” .  The  character  in  the  above  figure  could  be  represented 
with  N  =  1,  zm,x  —  1,  Vmax  =  2,  =  2,  ]/i  =  3,  d]  =  west,  and  <Tq .,  .try \  = 

SLSSLSLLRRLL.  We  can  tell  when  a  boundary  ends  by  when  it  closes  on  itself.  If  N  >  1 
the  string  oq  . . .  trn-i  can  be  broken  N  pieces,  each  of  which  form  a  closed  path.  The 
starting  coordinates  x,  and  y ,•  each  require  I  +  [lg(xmax  +  1)J  bits  and  1  4-  [lg(ymax  +  l)J 
respectively  and  the  initial  directions  d,-  will  fit  in  2  bits.  The  boundary  turns  Oj  are 
encoded  into  I  or  2  bits  as  follows. 

R  t— ►  00,  L  t-+  01,  S  1 

This  header  information  is  not  encoded  in  a  particularly  efficient  manner.  One  could 
envision  ways  of  encoding  x,-,  y,-,  and  d<  more  efficiently  by  picking  starting  points  with 
special  properties.  For  Chinese  characters  where  there  can  be  many  boundaries  some 
differential  scheme  might  be  tried.  All  these  methods  are  awkward  and  have  little  effect 
because  the  header  is  much  smaller  than  the  rest  of  the  file  anyway. 


String  matched  cointour  coding  . 

String  matched  contour  coding  is  a  slight  modification  of  the  above  scheme  where 
oq  . . .  o„_  |  are  encoded  differently.  A  string  matched  contour  coded  file  is 

* 

(N){Xn»x){Vmax)ziyidi  •  •  •  xNVN^Nrl (*1  >Pl fhfajpifl  . . .  Tm(km)pmrm  (2) 

where 

S(?) 


+i 

The  function  rev  just  switches  left  and  right  (i.e.,  rev(L)  =  R,  rev(f2)  =  L,  and  rev(S)  = 
S).  The  tth  4-tuple  u{ki)piri  means  “The  next  character  in  the  string  oq  •  •  •  on-i  is  r<. 
Now  start  at  aVi  and  copy  the  next  fcj  characters.  If  r,-  =  1  then  switch  left  and  right  as 
you  do  the  copying.”  The  r,*  could  be  encoded  just  as  the  Oi  for  contour  coded  files  but 
this  would  be  inefficient  because  we  can  assume  that  rq  aE(q)-  (Otherwise  we  could  use 
a  larger  value  for  kq~i.)  The  first  turn  rj  is  therefore  encoded  as  the  ot-  above,  while  the 
subsequent  rt-  require  only  1  bit.  The  numbers  fct-  are  variable  length  while  the  p,-  can  be 
fixed  length  numbers  because  we  require  that  0  <  pt-  <  E(t).  Finally,  the  are  only  1  bit 
long. 

2.  Encoding  Variable  Length  Numbers 

In  the  above  encodings  we  need  some  way  of  encoding  arbitrarily  large  integers 
efficiently.  All  encodings  must  have  unique  prefixes  so  that  we  can  tell  where  they  end. 
This  clearly  requires  longer  codes  for  larger  integers.  Simple  schemes  for  doing  this  are 
suggested  in  [5]  and  [9].  These  schemes  are  asymptotically  optimal,  however  they  behave 
poorly  for  numbers  of  the  size  that  our  are  likely  to  be.  To  combat  this  we  look  at  the 
probability  distribution  we  get  for  the  in  a  typical  character.  A  capital  “S”  537  pixels 
high  had  the  distribution  on  the  next  page.  The  column  labeled  “occurrences”  gives  the 
total  number  of  different  i  for  which  fc,-  was  in  the  specified  range.  In  the  column  labeled 
“normalized  occurrences”  this  is  divided  by  the  sice  of  the  range. 

The  “ideal”  way  to  encode  kj  would  be  to  use  Huffman’s  minimum  weighted  path 
length  algorithm  to  find  the  encoding  that  minimites  the  expected  average  length  of  the 
ki.  We  must  modify  this  approach  if  we  want  the  encoding  to  be  efficient  for  all  the 
distributions  of  fc,-  we  are  likely  to  encounter.  We  would  also  like  there  to  be  a  simple 
pattern  that  would  allow  us  to  compute  the  encoding  for  arbitrary  integers.  For  these 
reasons,  we  abstract  the  frequency  information  in  the  table  and  pretend  that  the  actual 
normalised  occurences  are  constant  up  to  a  certain  value  (say  about  20)  and  then  they  fall 
off  as  l/i“.  This  assumption  could  be  modified  without  too  much  difficulty.  We  will  see 
below  that  it  is  very  convenient  to  deal  with  probabilities  falling  off  as  l/x“.  In  addition, 
this  provides  a  reasonable  fit  with  the  above  data  for  large  k{,  with  a  2.75.  We  can 
adjust  this  parameter  to  obtain  somewhat  more  reasonable  behavior  for  large  numbers. 


=  q-l+22k<> 

—  0E{q)  and 

_  |<tj  +  1+E(4)> 

\rev(<rJ+H.E(,)), 


if  r  =  0, 
if  r  =  1; 


0  <  j  <  kq. 


4 
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taliased  occurrences 


3 

1 

1.0 

4-5 

2 

1.0 

0-7 

2 

1.0 

8-10 

14 

4.7 

11-15 

30 

6.0 

16-22 

41 

5.9 

23-31 

25 

2.8 

32-44 

15 

1.14 

45-63 

6 

0.32 

64-90 

4 

0.15 

91-127 

1 

0.03 

128-180 

1 

0.02 

181-255 

0 

0 

256-361 

1 

0.01 

We  can  derive  the  following  encoding  procedure  from  these  assumptions. 


function  encode(n); 
begin 

r  «-r0; 
t  « —  eo; 
if  n  <  t 
then  if  n  <  c 

then  return  the  lower  e  —  1  bits  of  n 
else  return  the  lower  e  bits  of  n  +  c 
else  begin 
x*-t\ 
d*- do  ) 

while  n  >  x  +d 
do  begin 

e  «—  e  +  1; 
z  «—  z  +  d; 
r  «—  2(r  —  d], 
d  *—  [r6  +  .5J; 
end; 

return  the  lower  e  bits  of  (2*  —  r  +  n  —  z) 

end; 

end; 
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For  the  above  function  we  assume  the  following  initialisation  has  taken  place. 

integer  ro,  e<>,  c; 
real  6; 

*«-(!-  */2)» 
r0  +~  U(s  ~  l)/4  +  0.5J; 
eo  ♦-  llg(r0  + 1  —  1)J  + 1; 
e  «—  2*°  —  r0  —  t; 
do  «-  I6r0  +  0.5J; 

The  process  is  controlled  by  the  constants  t  and  a.  All  numbers  less  than  t  are  given  nearly 
equal  length  encodings.  Beyond  t,  encodings  get  one  bit  longer  each  time  the  number 
increases  by  a  factor  of  a.  We  require  that  ta/( 2  —  s)  be  at  most  t  less  than  a  power  of  two 
so  that  c  <  t.  The  variable  e  is  the  length  of  the  encoding.  We  use  encodings  that  appear 
in  numerical  order  when  viewed  as  binary  fractions.  As  numbers  get  bigger  their  encodings 
acquire  more  and  more  leading  ones.  The  variable  r  gives  the  number  of  codes  of  length 
e  that  could  be  used  without  interfering  with  the  encodings  for  smaller  numbers.  We  can 
achieve  the  constant  factor  a  by  using  up  a  fixed  fraction,  6,  of  the  available  encodings  at 
each  step.  At  the  end  of  each  iteration  of  the  while  loop,  x  is  the  smallest  number  whose 
encoding  is  e  bits  long  and  there  will  be  exactly  d  different  numbers  with  encodings  e  bits 
long.  We  can  see  that  for  large  n, 

x  +  d  _r  —  d 

8  — - =  2 - 

x  r 

is  stable.  This  holds  when  6=1  —  a/2  as  set  in  the  initialization  block.  A  minimum 
weighted  path  length  encoding  for  a  probability  distribution  proportional  to  l/xa  would 
exhibit  this  kind  of  behavior  if  r/2*  is  proportional  to  /*  l/t“  dt.  This  holds  when  — 
81~a,  i.e.,  8  —  2l/a. 

Consider  the  case  where  t  =  10  and  s  =  1.7.  This  gives  6  =  .15,  ro  =  47,  «o  =  6* 
and  c  =  7.  We  get  the  encodings 


0 

00000 

• 

6 

00110 

7 

001110 

16 

010111 

17 

0110000 

28 

0111011 

29 

01111000 

48 

10001011 

49 

100011000 

l 

l  , 
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If  we  run  the  encoding  algorithm  on  n  =  314,  for  example,  we  get 


e 

X 

r 

d 

(x  +  d)/x 

6 

10 

47 

7 

1.700 

7 

17 

80 

12 

1.706 

8 

29 

136 

20 

1.690 

9 

49 

232 

35 

1.714 

10 

84 

394 

59 

1.702 

11 

143 

670 

101 

1.706 

12 

244 

1138 

171 

1.701 

which  makes  the  encoding  of  314  equal  to  the  lower  12  bits  of  212  —  1138  +  314  —  244  or 

101111010100. 

Decoding  variable  length  numbers 

The  process  of  decoding  variable  length  number  3  in  the  above  format  is  quite  straight 
forward.  Since  speed  is  important  in  the  decoding  process,  it  may  be  desirable  to  use 
tables  to  find  the  values  of  k  and  x  obtained  below.  Here  we  give  a  version  of  the  decoding 
algorithm  that  uses  no  such  tables.  This  is  more  instructive  and  still  quite  fast. 

function  decode; 
begin 

read  eo  —  1  bits  into  m; 
if  m  <  c  then  return(m); 
m  •*—  2m  +  (read  in  the  next  bit); 
n  «—  m  —  c; 

if  n  <  t  then  return(n); 
e  ■«  t#; 
x  «—  t; 
r  «-r0; 
d «—  do; 

k  *—  2*  —  r  —  *; 

do  begin 
if  «  >  «o 

then  m  «—  2m  +  (read  in  the  next  bit); 
n  <—  m  —  fc; 
e  *-  e  + 1; 
x  «—  x  +  d; 
r  *—  2{r  —  d); 
d  *-  |rb  +  .5); 
k  «-  2*  —  r  —  x; 
end 

until  n  <  x; 
return(n); 

end; 


7 


All  the  variables  in  the  decoding  algorithm  are  analogous  to  the  identically  named  variables 
in  the  encoding  algorithm.  The  variable  k  keeps  track  of  the  quantity  that  would  have 
been  subtracted  from  n  if  the  encoding  algorithm  had  ended  on  the  current  step.  The 
variables  m  and  n  are  the  number  read  from  the  input  so  far  and  how  it  would  be  decoded 
if  we  had  read  the  correct  number  of  bits  from  the  input.  When  x  surpasses  this  number 
it  tells  us  we  have  read  enough  bits. 

3.  Data  Compression  Algorithms 

We  will  concentrate  mainly  on  the  method  for  creating  string  matched  contour  coded 
files  given  the  contour  information.  We  can  easily  find  the  contour  information  by  using 
the  fact  that  the  contour  contains  a  line  between  (x,  y)  and  (x,  y  + 1)  if  and  only  if  the  pixel 
at  (x,  y)  differs  from  that  at  (x  —  1,  y )  and  similarly  we  look  at  pixels  (x,  y)  and  (x,  y  -  1) 
to  see  if  the  contour  contains  the  line  between  (x,y)  and  (x  +  1,  y).  We  can  therefore  use 
shifting  and  masking  instructions  to  avoid  looking  at  the  pixels  individually. 

For  the  final  step  of  the  data  compression  process,  we  will  assume  that  the  contour, 
a0,  •  •  ■  >°n- 1>  has  been  put  into  an  array  and  that  our  goal  is  to  find  tv,  k,-,  pit  and  r*  for 
*  =  1, . . .,  m.  For  convenience  we  will  refer  to  the  values  of  er,  as  L,  R,  or  5,  and  the  values 
of  ri  as  “true”  or  “false” .  Let  be  the  substring  ova,+1 ...  try  with  the  understanding  that 
ffij  denotes  the  empty  string  when  i  >  j.  Recall  that  £(i)  is  the  number  of  characters 
matched  by  ri(fcj)pirt . . . rv_i(fcv-i)pv_irv_t.  We  will  use  the  function  rev(o)  described 
above  and  define  rev(<7tJ)  =  rev(<7,) . . .  rev(oJ).  In  order  to  find  the  shortest  encoding  (2) 
we  must  successively  find  the  position  p,-  where  op.oPi+\. . .  matches  as  much  as  possible 
of  °rE(«)+i<TE(*)+2,  •  •  for  * =  1, . .  • ,  m.  A  straight  forward  implementation  would  be  0(n2) 
but  [6]  gives  a  linear  time  algorithm  that  can  be  adapted  to  this  problem.  We  will  show 
how  to  use  this  algorithm  to  produce  the  4-tuples  in  (2). 

It  will  be  convenient  to  adopt  some  of  the  notation  used  in  [6].  Let  sufj  be 
and  let  head,-  be  the  longest  prefix  of  suf;  that  also  occurs  starting  at  position  j,  j  <  t. 
(In  other  words,  head,-  =  a ,-,<+*  =  Oj,j+k,  where  fc  is  maximal  and  j  <  i.)  Note  that 
the  starting  position  j  of  the  maximal  string  might  not  be  unique.  The  algorithm  in  [6] 
proceeds  by  constructing  a  suffix  tree  using  McCreight’s  algorithm  [7].  In  our  case  we  will 
need  two  suffix  trees,  one  containing  all  strings  suf,-  for  0  <  t  <  to  —  1  and  the  other 
containing  rev(suf,)  for  0  <  i  <  n  —  1.  The  algorithm  in  [6]  starts  with  an  empty  tree  and 
inserts  successively  sufo,  suf i , . . . ,  sufn_i .  It  determines  k,-  and  p<  from  the  location  where 
sufjj(j)  is  inserted  in  the  tree.  The  algorithm  in  [6]  does  not  deal  with  anything  equivalent 
to  our  r,-. 

We  can  run  two  copies  of  this  algorithm  in  parallel  inserting  sufo,  suf  i , . . . ,  sufn_ i  into 
suffix  tree  A  and  rev(sufo), rcv(suf|), . . .  ,rev(sufn_j)  into  suffix  tree  B.  We  can  then  set 
k<  and  p<  either  according  to  where  suf,-  is  inserted  in  tree  A  or  where  rev(suf,)  is  inserted 
in  tree  B.  We  choose  the  option  that  results  in  the  largest  value  of  k<.  If  we  use  the  first 
option  then  r,-  is  false  and  otherwise  is  true.  Using  this  method  the  algorithm  on  the 
next  page  produces  the  4-tuples  r1(ki)pir1 . . . rm(km)pmrm  from  (2). 

The  algorithm  presented  in  [6]  finds  pv  and  fc,-  before  t\.  This  has  the  effect  of  making 
Pi  and  ki  automatically  0  and  complicates  the  termination  conditions.  It  is  necessary  to 
treat  the  last  triple  differently  in  this  case  to  make  sure  that  km  is  not  too  large  to  allow 
rm  a  meaningful  value.  In  [6j  (to  simplify  the  presentation)  it  is  assumed  that  <7n_i  ^  a y 
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«/  -I 

while  i<  n 

do  begin 

r  ♦-  ffj; 

if  i '  —  n  —  1  then  output(r,  0, 0) 

else  begin 

while  j  <  i  +  1 
do  begin 

insert  sufy  in  suffix  tree  A; 

insert  rev(sufy)  in  suffix  tree  B ; 

3  3  +  1. 

end; 

output(r ,  A:,+ ! ,  pi+ 1 ,  ri+ 1); 
end; 

i  <—  i  +  k  +  1; 

end; 

end; 

for  any  j  <  n  —  1.  This  ensures  that  no  head;  =  suf^.  In  our  case  this  assumption  is  not 
necessary  because  if  head,-  =  suf,-  then  i  =  E(m)  and  we  are  done. 

The  decoding  algorithm 

The  decoding  problem  is  how  to  get  from  one  of  the  compressed  formats  (1)  or  (2)  back 
to  the  raw  bitmap.  Decoding  the  contour  coded  files  (1)  is  essentially  a  process  of  marching 
around  the  boundary  flipping  certain  bits  in  the  bitmap.  Decoding  string  matched  files 
(2)  combines  this  with  decoding  variable  length  numbers  and  copying  parts  of  a  buffer. 
We  can  conceptually  break  these  problems  into  two  pieces:  reading  (2)  and  filling  a  buffer 
array  with  and  combining  the  information  in  such  a  buffer  with  that  in  the 

preamble  to  obtain  a  raw  bitmap.  (Actually  these  problems  are  not  entirely  separate 
because  when  solving  the  first  subproblem  the  information  in  the  preamble  does  not  tell 
us  when  to  stop  without  tracing  the  contours  to  see  when  they  close  on  themselves.) 

In  going  from  the  string  matched  format  to  the  pure  contour  format  we  will  take  a a 
given  the  ability  to  get  fixed  format  and  variable  length  numbers  from  the  input  file.  The 
algorithm  is  then  very  simple.  (See  next  page.)  We  refer  to  the  entries  of  the  buffer  we 
are  generating  as  oq  . . .  <rn_j . 

The  next  subproblem  is  going  from  the  buffer  containing  the  bounding  contours  to  the 
raw  bitmap.  It  is  easy  enough  to  follow  the  contour  keeping  track  of  the  current  position 
and  direction.  One  has  to  study  the  formats  described  above  to  avoid  “off  by  one  errors”. 
For  each  vertical  edge  in  the  contour  between  (ioil/)  and  (xo,y  4-  1)  we  have  to  flip  all  the 
bits  in  locations  (x,y)  where  x  >  io-  We  can  do  this  quickly  by  having  another  bitmap 
with  one  bit  for  each  word  in  the  original.  We  then  flip  the  right  had  a  part  of  the  word 
containing  the  pixel  ( i0,y )  and  the  right  hand  part  of  the  word  in  the  auxiliary  table 
corresponding  to  the  y  row.  (We  assume  a  two  level  hierarchy  is  sufficient.)  After  all 
the  bounding  contours  have  been  traced  we  need  one  additional  pass  to  flip  all  the  words 
whose  bits  in  the  auxiliary  are  still  on. 


begin 

*  *-  0;  j  «—  1; 

repent 

<H  «-  Tj\ 

*  «-  *  + 1; 

for  k  «-  0  thru  kj  —  1 
do  <Tj+k  «-  (if  rj  then  rev(trPi+fc) 
else  ffpj+k): 

i  <- i  +  k; 

3  *-  3  + 1; 
until  done; 

end; 

Decoding  speed 

The  above  algorithm  was  implemented  on  a  DEC- 10  computer  at  Stanford.  Some 
attention  was  paid  to  the  speed  of  the  code  but  more  use  of  assembly  language  and  a  few 
other  improvements  could  probably  achieve  significant  gains.  The  program  started  fro’ 
the  string  matched  format,  created  a  buffer  containing  the  bounding  contours,  and  us 
this  to  regenerated  the  bitmap.  There  are  two  stages  to  the  compution:  first  the  buffer 
created  and  used  to  scan  the  bitmap,  and  then  the  auxiliary  table  is  used  to  regenera 
the  rest  of  the  bitmap.  The  complexity  of  the  first  stage  is  linear  in  the  length  of  the  inp 
(actually  O(nlgn)  in  general),  while  the  second  stage  takes  0(n2). 


character 

stage  1 

stage  2 

total 

it 

25  ms. 

4  ms. 

29  ms. 

$ 

30  ms. 

4  ms. 

34  ms. 

40  ms. 

4  ms. 

44  ms. 

ti 

43  ms. 

4  ms. 

47  ms. 

26  ms. 

4  ms. 

30  ms. 

* 

24  ms. 

4  ms. 

28  ms. 

ffl 

23  ms. 

5  ms. 

28  ms. 

31 

37  ms. 

5  ms. 

42  ms. 

ft 

46  ms. 

5  ms. 

51  ms. 

X 

22  ms. 

3  ms. 

25  ms. 

We  can  see  that  the  quadratic  term  is  still  quite  small  at  this  resolution.  Even  at 
four  times  this  resolution  about  60%  the  time  is  still  spent  in  stage  1.  The  above  results 
are  good  considering  the  amount  of  data  being  handled,  but  in  many  applications  faster 
decoding  may  be  required.  The  process  is  probably  simple  enough  to  put  into  hardware  if 
necessary. 

Compression  Results  for  Chinese  Characters 

The  following  table  gives  compression  figures  for  a  wide  variety  of  Chinese  characters. 
These  characters  were  designed  with  Tung  Yun  Mei’s  LCCD  system  (8).  All  the  results 
given  here  were  checked  with  the  decoding  algorithm  to  verify  that  the  original  bitmap 
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Chinese 

character 

rcso 

iution  105  X 

L41f 

resolution  203  X  275f 

differential 
run  length^ 

contour 

coded 

string 

matched 

contour 

coded 

string 

matched 

% 

7184 

1819 

1291 

3477 

1965 

2326 

1736 

2528 

t 

5504 

1411 

1243 

1856 

iB 

6416 

1301 

1031 

1315 

*a 

9792 

2026 

1453 

1911 

* 

6144 

1227 

991 

1448 

* 

1577 

1508 

2107 

It- 

1472 

911 

2412 

1028 

1710 

1211 

1592 

ft 

11888 

1615 

1363 

3143 

2144 

8336 

1652 

3779 

2557 

tt 

13584 

1962 

4171 

2983 

It 

12240 

2365 

2116 

• 

A 

7360 

1633 

1201 

1475 

» 

13744 

2333 

2208 

4459 

3199 

a 

13664 

2314 

2093 

2876 

n 

2236 

1992 

2896 

ft 

2245 

2084 

H  ■ 

2967 

ft 

10808 

1637 

1466 

3059 

it 

8128 

1441 

1299 

2881 

1936 

*s 

16928 

2485 

2264 

a 

1989 

1741 

3979 

2599 

ft 

14464 

2424 

2512 

3728 

« 

1835 

1512 

3409 

2001 

t: 

7312 

1503 

1377 

2704 

1944 

•?. 

1883 

1542 

3719 

2057 

:*\ 

u< 

2152 

1710 

3972 

2534 

14 

13552 

2275 

2185 

4553 

3381 

i’ll 

6736 

1601 

1161 

1550 

2083 

1791 

2549 

13728 

2105 

1795 

2706 

2257 

1963 

4544 

2889 

i 

12800 

2355 

1991 

, 

ft 

8224 

1613 

1327 

t£ 

10096 

1636 

1380 

1954 

?( 

13040 

2265 

15)89 

4590 

3013 

M 

9520 

1870 

1581 

3726 

2327 

$ 

2427 

2122 

4828 

3328 

M 

10896 

1497 

1210 

3171 

1909 

*2 

11520 

2413 

1937 

4594 

2987 

tMaximum  size  of  bounding  box  over  all  characters 

J9  point  Sung  Dynasty  style  characters  with  144  X  144  typeface 
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could  actually  be  regenerated.  The  table  entries  give  the  number  of  bits  needed  to  represent 
each  of  the  characters  below  at  two  different  resolutions  and  in  three  different  formats. 
The  data  on  the  right  side  of  the  table  are  for  the  same  characters  at  about  twice  the  size. 
For  comparison  a  bitmap  encoding  would  require  at  least  105  X  141  or  14,805  bits  for  the 
lower  resolution  and  203  X  275  or  55,825  for  the  higher.  The  column  labeled  “differential 
run  length”  gives  actual  data  from  an  ideographic  image  digitizer.  It  refers  to  9  point  Sung 
Dynasty  style  Chinese  characters  from  [4].  The  type  face  for  these  characters  is  144  X  144 
pixels,  but  the  type  body  is  probably  about  the  same  size  for  the  other  characters.  The 
resolutions  stated  below  are  merely  for  the  minimum  size  bitmap  that  could  contain  all 
these  characters  (i.e.  type  body).  We  can  see  that  the  string  matched  format  represents 
these  characters  most  efficiently.  The  compression  ratio  relative  to  bitmap  encoding  ranges 
from  5.9  to  16.2  at  the  lower  resolution  and  from  16.5  to  54.3  at  the  higher  one.  The 
compression  ratio  relative  to  simple  contour  coding  ranges  from  .965  to  1.615  at  the  lower 
resolution  and  from  1.32  to  2.35  at  the  higher  one.  There  was  only  one  case  where  contour 
coding  outperformed  the  string  matched  format.  String  matched  encoding  compares  very 
favorably  with  the  differential  run  length  format  that  achieved  compression  ratios  relative 
to  the  bitmap  ranging  from  .87  to  2.03. 

The  string  matched  compression  format  is  particularly  well  suited  to  high  resolution 
characters.  The  relative  gain  from  the  string  matching  procedure  over  the  contour  coded 
format  increases  with  resolution.  In  addition  to  this,  the  relative  overhead  in  terms  of 
decoding  time  decreases.  The  table  below  also  refers  to  the  characters  we  designed  with 
the  LCCD  system.  Resolutions  are  measured  as  before.  Magnification  factors  of  1,  2,  3, 
and  4  were  used  to  achieve  the  different  size  bitmaps  refered  to  below. 


contour  coded  bits  vs.  string  matched 


Chinese 

character 

96  X  131 

186  X  255 

277  X  376 

368  X  499 

contour 

coded 

string 

matched 

contour 

coded 

string 

matched 

contour 

coded 

string 

matched 

contour 

coded 

string 

matched 

31 

1472 

991 

2412 

1028 

3418 

1224 

4384 

1220 

i'l 

1301 

1031 

2353 

1315 

3409 

1483 

4421 

1733 

iS 

1633 

1201 

3064 

1475 

4466 

1842 

5866 

1998 

tin 

1601 

1161 

3059 

1550 

4517 

1904 

6089 

2358 

$ 

1883 

1452 

3719 

2057 

5625 

2799 

7417 

3027 

£ 

1835 

1512 

3409 

2001 

5129 

2506 

6635 

2701 

Vs 

2257 

1963 

4544 

2889 

6598 

3586 

8694 

4204 

1? 

2355 

1991 

4597 

2902 

6764 

3454 

9008 

4146 

Hi 

2005 

1652 

3779 

2557 

5593 

3291 

7302 

3990 

to 

2275 

2185 

4553 

3381 

6629 

3853 

8855 

4269 

bits  at  different  resolutions 


Note  that  in  the  first  line  of  the  table,  string  matching  actually  produced  a  more  compact 
encoding  at  the  highest  resolution  than  at  the  second  highest.  It  is  interesting  to  compare 
the  string  matched  compression  figures  in  the  above  table  with  other  methods  for  com¬ 
pressing  the  contour  coded  format.  Recall  that  the  contour  coded  format  used  a  simple 
Huffman  code  with  R,  L,  and  S  encoded  00,  01,  and  1  respectively.  This  scheme  averages 
about  1.58  bits  per  turn.  A  more  complicated  encoding  using  all  strings  of  length  six  over 
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the  alphabet  {L,  R,  S}  achieves  at  best  .86  bit  per  turn  for  a  saving  of  about  a  factor  of 
1.8.  We  can  see  that  the  string  matched  method  easily  surpasses  this  at  high  resolutions. 
Furthermore  the  string  matched  format  has  much  more  potential  for  improvement. 

It  is  interesting  to  see  how  the  string  matched  compression  method  responds  to 
the  different  kinds  of  strokes  used  in  Chinese  characters.  There  are  five  fundamental 
strokes:  dots  (  ■  ),  horizontal  strokes  ( — ),  vertical  strokes  (  |  ),  “Pie"  strokes  (  '  ),  and 

“Na”  strokes  (  V ).  Chinese  characters  can  be  divided  into  two  classes:  those  that  consist 
mainly  of  horizontal  and  vertical  strokes,  and  those  that  consist  mainly  of  dot,  “Pie”, 
and  “Na”  strokes.  The  former  class  is  more  common  occurring  at  a  frequency  of  about 
60%.  Characters  that  consist  mostly  of  horizontal  and  vertical  strokes  arc  in  general 
easier  to  compress,  but  the  string  matched  method  appears  to  respond  particularly  well  to 
them.  The  method  responds  reasonably  well  to  straight  lines  at  constant  slope  because  any 
such  line  will  become  a  repeating  pattern  of  letters  from  the  alphabet  {L,R,S}.  Curved 
boundaries  present  more  problems.  Compare  the  following  compression  results  for  isolated 
strokes.  (To  facilitate  a  more  direct  comparison  the  bit  counts  given  do  not  include  the 
preambles.) 

kind  of  stroke  bounding  box  contour  coded  string  matched 
horizontal  94  X  13  284  160 

vertical  11  X  105  248  128 

Pie  45  X  64  433  296 

The  following  table  shows  how  this  relates  to  full  characters.  The  characters  on  the  left 
consist  entirely  of  horizontal  and  vertical  strokes;  those  on  the  right  have  mostly  Pie  and 
Na  strokes.  These  characters  are  all  at  the  same  resolution’ we  have  been  using:  maximal 
bitmap  size  105  X  141. 


Chinese 

character 

contour 

coded 

string 

matched 

Chinese 

character 

contour 

coded 

string 

matched 

dj 

854 

361 

A 

919 

659 

841 

470 

X 

1025 

1 

959 

A 

■s 

1115 

1243 

mSM 

* 

mSBM 

1105 

1 

1497 

707 

% 

1413 

1182 

s 

It  is  clear  that  the  complexity  of  a  character  has  a  great  effect  on  the  number  of 
bits  needed  to  represent  it.  To  see  how  the  characters  above  compare  to  other  Chinese 
characters  we  give  the  following  table  of  the  stroke  count  distribution  of  6,763  frequently 
used  characters  from  (4). 
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stroke  count 

number  of  characters 

1-3 

93 

4-6 

753 

11.1% 

7-9 

2014 

29.8% 

10-12 

2037 

30.2% 

13-15 

1183 

17.5% 

16-18 

499 

7.4% 

19-21 

145 

2.1% 

22-24 

35 

0.5% 

25-27 

4 

0.1% 

Potential  Improvements 

There  are  several  ways  the  compression  figures  for  Chinese  characters  might  be  im¬ 
proved.  First  of  all,  since  the  characters  are  composed  of  simple  strokes  it  would  be  better 
to  represent  characters  as  a  union  of  strokes  instead  of  just  giving  the  overall  outline.  This 
would  avoid  many  discontinuities  in  the  bounding  contours.  Secondly,  we  should  take  into 
account  the  distribution  of  pt.  They  are  most  likely  to  refer  to  positions  near  the  current 
position  (i.e.,  E(i))  and  possibly  tend  to  cluster  in  other  places  as  well.  This  would  be 
particularly  important  if  the  characters  were  being  expressed  as  the  union  of  their  strokes. 
Some  kind  of  escape  sequence  or  simple  paging  scheme  could  probably  do  this  without 
much  additional  overhead.  The  preambles  in  the  existing  format  arc  blatantly  inefficient. 
It  would  be  particularly  important  to  improve  this  if  we  had  separate  boundaries  for  each 
stroke.  Finally,  if  we  relax  our  standards  for  perfect  replication  we  could  probably  achieve 
significant  gains  through  a  kind  of  smoothing  process.  In  the  string  matching  procedure 
we  could  look  for  ways  to  lengthen  the  matches  by  changing  a  few  boundary  pixels.  This 
could  be  made  not  to  damage  the  appearance  of  the  characters  and  would  achieve  better 
compression  without  increased  decoding  time. 
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